Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Dec 17, 2024

  • link to blog post for real world use case
  • explain how dask parallelism works
  • recommend squash history after writing to HF
  • add a distributed data processing example
  • explain predicate and projection pushdown + example

@lhoestq lhoestq requested a review from davanstrien December 17, 2024 14:53
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@davanstrien davanstrien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! If you don't have a chance I can also add something similar for Polars. I had one question about how Dask does filtering for the downloads. I probably misunderstood something here.


# Dask will skip the files or row groups that don't
# match rhe query without downloading them.
df = df[df.dump >= "CC-MAIN-2023"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Dask not still need to download the data to check the values in this column match this filter? From what I understood in the Polars case the predicate push down is usually used for skipping the reading of a column i.e. if you drop it later it doesn't bother to load it and/or doing a filtering step early on. Is Dask directly able to do this before loading?

Copy link
Member Author

@lhoestq lhoestq Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it will skip the row groups which don't have any row that matches the query using the row group metadata

Then on the remaining row groups it will download the column used for filtering to apply the filter

The other columns will be downloaded or not based on the other operations done on the dataset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you drop it later it doesn't bother to load it and/or doing a filtering step early on. Is Dask directly able to do this before loading?

yes correct ! would be cool to explain that here as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's super cool!! For some datasets the download time does seem to end up becoming a blocker so this is very neat!

Co-authored-by: Daniel van Strien <[email protected]>
@lhoestq
Copy link
Member Author

lhoestq commented Dec 17, 2024

thanks for the review ! merging this one for now but lmk if you have more comments

@lhoestq lhoestq merged commit 0ea864e into main Dec 17, 2024
2 checks passed
@lhoestq lhoestq deleted the update-dask-docs branch December 17, 2024 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants